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universe  must  be  unambiguously  defined  but  it  is  not  necessary  to  assume  that 
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covariances.  A study  of  general izabili tv  is  conducted  by  taking  measurements 
on  persons,  stimuli,  tasks,  etc.  that  are  assumed  to  be  randomly  representa- 
tive of  a universe  an  investigator  wishes  to  generalize  to.  The  ratio  of  an 
estimate  of  the  universe  "score"  variance  to  an  estimate  of  the  observed 
score  variance  is  the  coefficient  of  general  izability.  This  is  estimated  by 
the  intra-class  correlation  coefficient.  ANOVA  and  the  Expected  Mean  Square 

paradigm  of  Cornfield  and  Tukey  is  used  to  obtain  the  appropriate  variance 
estimates. 

The  theory  dispenses  with  unnecessary  and  unwarranted  assumptions,  and 
eliminates  the  distinction  between  reliability  and  validity.  Any  generaliz- 
ability  study  can  be  conducted  without  reference  to  having  a parallel  measure 
of  the  MAU  instrument  or  some  external  criterion  of  "success".  If  a MAU 
technique  is  compared  to  some  non-MAU  technique  for  doing  the  same  thing  then 
it  is  possible  to  calculate  the  coefficient  of  generalizability  for  both 
methods  thus  allowing  the  investigator  to  decide  which  is  best  for  his  or  her 
purposes.  Three  numerical  examples  are  given  of  the  theory.  Preliminary  in- 
vestigations have  indicated  that  MAU  models  and  techniques  based  on  sue1'  mod- 
els may  be  bette1"  than  non-MAU  models  since  the  former  have  a tendency  to 
reduce  the  interaction  between  judges  and  the  thing  being  judged  when  such 
interaction  represents  inconsistency  of  judgment. 
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Assessing  the  Reliability  and  Validity 
of  Mul ti -Attri bute  Utility  Procedures: 

An  Application  of  the  Theory  of  General izability 
J.  Robert  Newman 

Social  Science  Research  Institute 
University  of  Southern  California 

Host  Important  decisions  Involve  choosing  among  alternatives  with 
multiple  value  characteristics.  For  example.  In  deciding  what  home  to  buy, 
some  of  the  value  relevant  characteristics  might  be:  number  of  rooms,  price, 
location,  potential  as  an  Investment,  and  so  on.  A set  of  multi-attribute 
utility  (MAO)  models  and  procedures  have  been  proposed  as  an  aid  in  making 
such  decisions  (Edwards,  1971;  Half fa,  1968,  1969). 

The  basic  idea  of  a MAU  procedure  is  to  "Divide  and  Conquer"  (Raiffa, 
1968,  p.271).  There  are  three  basic  steps  in  this  process.  First,  the 
decision  problem  is  broken  up  into  little  pieces  (attributes)  along  natural 
lines  depending  upon  the  nature  of  the  task.  Second,  separate  judgments  are 
made  about  each  of  the  component  pieces.  As  a rule,  there  are  two  such 
judgments,  numerical  judgments  about  the  importance  of  each  attribute  relative 
to  each  other  and  numerical  judgments  about  the  "worth"  or  utility  of  each 
attribute  to  each  of  the  competing  decision  alternatives.  Finally,  these 

separate  judgments  are  aggregated  using  some  formal  algebraic  rule  and  this 
is  used  as  an  aid  to  the  final  decision. 

Advocates  of  HAU  procedures  have  offered  them  as  a replacement  for 
"wholistlc”  procedures  in  which  the  decision  maker  forms  an  overall  intuitive 
evaluation  of  a decision  alternative.  Such  advocates  have  also  argued  that 
HAU  procedures  are  "better"  in  the  sense  that  they  are  more  reliable  and 
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valid  than  wholistic  procedures.  It  Is  this  comparison  that  forms  the  moti- 
vation behind  this  paper  which  stems  from  a disillusionment  with  the  way  multi- 
attribute utility  (MAUM)  models  and  procedures  are  being  assessed  as  to  their 
reliability  and  validity.  I would  like  to  suggest  a re-fonnulation  thay  may 
resolve  a good  many  conceptual  and  practical  difficulties.  The  reformulation 
is  based  on  some  work  of  Lee  Cronbach  and  his  Associates  (Cronbach,  et  al.. 

1972)  and  is  called  the  Theory  of  General izabll ity.  Before  describing  what  this 
theory  is  all  about  and  why  I think  it  may  be  of  considerable  use  in  studies  of 

MAU  models,  let  me  first  give  an  example  of  the  conceptual  difficulties  that 
the  classical  theory  can  lead  to. 

Consider  the  concept  of  validity.  Under  the  classical  theory  validity 
(sometimes  called  predictive  validity,  concurrent  validity  or  convergent  vali- 
dity) is  defined  as  the  relation  between  the  measuring  Instrument  and  other 
criteria  of  "success".  If,  for  example,  you  had  a method  for  assessing  the 
amount  of  anxiety  In  a person  via  a paper  and  pencil  test  and  could  demonstrate 
that  such  scores  correlated  highly  with  an  Independent  physiological  measure  of 
anxiety  (e.g.,  palmer  sweating)  then  you  could  argue  that  the  paper  and  pencil 
test  did  Indeed  have  validity  from  a psychometric  standpoint  which  also  makes 
sense  from  a behavioral  standpoint.  The  basic  references  on  the  classical  theory 
of  validity  are  Gulllksen  (1950)  and  Lord  and  Novlck  (1968).  Now  consider  how 
Mau  techniques  have  been  validated.  Fisher  (1972),  Huber,  Daneshgar,  and  Ford 
(1971),  and  more  recently  G.rdlner  (1974)  used  as  the  validating  criterion  for 
various  MAU  techniques  whollstic  judgment  In  certain  decision  situations.  Each 
of  these  Investigators  demonstrated  that  decomposed  additive  utility  models 
(MAU  techniques)  correlated  highly  with  Intuitive  whollstic  judgments  In  judg- 
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ment  tasks  and  used  th.se  results  to  argue  that  the  additive  utility  models 
were  therefore  valid  In  the  sense  that  they  were  capable  of  predicting  a cri- 
terion, namely,  the  wool  1st 1c  judgments.  There  Is  an  obvious  error  In  logic 
here.  If  the  decomposed  additive  utility  models  are  supposed  to  be  "better" 

than  whollstlc  Judgments,  why  use  whollstlc  Judgment  as  the  validating  criteria 
for  the  MAU  technique? 

There  are  other  difficulties.  It  can  be  demonstr„*ed  that  whollstlc 
Judgments  are  not  as  reliable  as  decomposed  judgments  (Fisher,  1972;  Gardiner, 
1974).  Therefore  we  are  using  a less  reliable  criterion  to  provide  evidence  for 
predictive  validity  for  a more  reliable  MAU  technique.  Aside  from  an  apparent 
error  of  logic  here,  consider  what  the  classical  theory  of  reliability  and 

validity  has  to  say  about  such  a situation.  There  are  two  cases  to  consider; 
Correction  for  an  unreliable  criterion 

If,  according  to  the  classical  model,  >uu  have  a reasonably  reliable 
measuring  technique  and  you  are  using  a fallible  criterion  to  assess  Its  validity 
then  It  is  logically  unfair  to  make  It  appear  that  the  measuring  technique  Is 
less  valid  than  It  really  Is.  It  Is  desirable,  therefore,  to  correct  predictive 

validity  coefficients  for  attenuation  In  the  criterion  measurement  (Gulllksen, 
1950).  The  formula  for  such  correction  Is 
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where  r'xy  ,s  the  corrected  validity  coefficient,  r ,s  the  correlation  between 




the  measuring  Instrument  * and  the  criterion  y.  and  r Is  the  reliability  of 


the  criterion  y.  (Note:  the  expression  ,s  referred  to  as  the  Index  of 
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reliability.) 

Gardiner  (1974,  p.  Ill)  presents  test-retest  correlations  (index  reli- 
abilities) for  wholistic  judgnents  as  having  an  average  of  .75  and  validity 
coefficients  having  an  average  of  .66.  This  latter  was  the  largest  average  of 
the  two  calculations  he  did  in  correlating  his  Multi-Attribute  Rating  Scale 

(MARS)  with  the  intuitive  wholistic  Judgments.  Thus  applying  the  above  fomula 
using  his  results  we  have: 


r' 


xy 


.66 


.75 


= .88 


This,  of  course,  according  to  the  classical  theory,  makes  Gardiner’s 
MARS  technique  have  appreciably  higher  validity  than  he  reported. 

Shrinkage  of  validity  coefficipntc 

Any  validity  coefficient,  whethe-  it  be  a correlation  between  a p.edictor 
(measurement)  and  a criterion  or  whatever.  is  designed  to  yield  the  best  possible 
prediction  tor  the  sample  of  data  on  which  it  was  developed.  If  the  prediction 
equation  which  the  coefficient  represents  actually  was  app,ied  to  a new  sample 
of  data  the  predictions  invariably  would  be  worse  and  the  resulting  validity 
coefficient  lower.  This  phenomenon  is  called  shrinkage  of  the  validity  co- 
efficient. The  amount  of  shrinkage  is  an  indication  of  how  much  biased  upwards 
the  origin,,  coefficient  was.  The  classical  theory  has  a fomul,  for  correcting 
the  original  validity  coefficient  for  this  upward  bias.  The  fomula  is- 
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where  rxy  is.  the  estimated  corrected  coefficient;  rxy  is  the  original  coefficient. 


and  N is  the  number  of  observations  in  the  original  sample.  When  we  apply  this 
formula  to  the  results  presented  by  Gardiner  who  used  a sample  of  size  N = 14 
subjects,  and  whose  average  validity  coefficient  was  .66  we  have 
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fxy  ■ Cl  - (1  - .662)(  ||  )]  * 

= .58 

Thus  we  see  the  classical  theory  tells  us  to  correct  the  coefficient  upwards  to 
provide  for  a fallible  criterion  and  to  correct  it  downward  to  provide  for  the 
over  estimate  due  to  sampling  errors  and  capitalizing  on  chance  in  the  vali- 
dating sample.  The  Classical  theory  does  not  tell  us  which  of  these  coefficients 
is  "best". 

One  possible  solution  is  to  use  the  classical  theory  to  first  reduce  the 
over  estimated  validity  coefficient  using  formula  (2)  and  then  apply  the  correc- 
tion for  attenuation  formula  (1)  to  adjust  for  a fallible  criterion.  If  we  do 
this  with  Gardiner's  data  the  obtained  estimate  is  now  .77. 

All  of  these  coefficients  make  sense,  if  we  adopt  the  classical  theory 
and  use  It  as  a guideline  to  assess  the  validity  of  the  measuring  Instrument. 

I think  you  will  agree,  however,  that  things  can  be  a little  confusing.  There 
should  be  a better  way,  and  Cronbach  and  his  associates  have  Indicated  that 
there  Is  Indeed  a better  way  which  they  call  a Theory  of  Gererallzablllty. 

The  Theory  of  General Izabllltv 

Basic  Concepts 

To  ask  the  question  of  how  reliable  or  dependable  a measure  Is,  Is  to 
«$k  how  well  one  can  generalize  from  the  observation  at  hand  to  some  universe 
or  domain  of  observations  to  which  It  belongs.  To  ask  about  rater  agreement 
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in  MAU  studies  Is  to  ask  how  well  we  can  generalize  from  one  set  of  ratings  to 
ratings  xrom  all  possible  raters  who  might  have  been  chosen  to  actually  do  the 
rating  In  the  particular  study.  The  theory  requires  the  Investigator  to  specify 
the  universe  of  conditions  of  observation  over  which  he  wishes  to  generalize. 
Conditions  is  a generic  term  referring  to  observers  (raters),  forms  of  stimuli, 
occasions,  etc.  In  addition  to  generalizing  to  a universe  of  raters,  for  example, 
we  may  also  wish  to  generalize  to  a universe  of  situations  In  which  the  ratings 
were  made.  Miller,  Kaplan,  and  Edwards  (1968)  studied  the  efficacy  of  a Utility 
model  performing  under  four  military  logistical  situations.  It  may  be  of  Inter- 
est to  know  how  well  one  could  generalize  from  these  four  situations  to  all 
possible  situations  which  the  particular  four  represented.  Gardiner  (1974) 
used  15  "typical"  housing  development  permit  requests  In  hi<  application  of 
MAU  techniques  to  Coastal  Zone  Management  Decision  Making.  It  Is  of  Interest 
to  know  how  representative  these  15  permit  requests  were  and  therefore  how  well 
one  can  generalize  to  the  universe  of  such  permit  requests. 

Questions  concerning  general Izabllity  are  substantive  not  just  methodo- 
logical. They  require  thinking  about  the  class  of  observations  and  not  just 
the  measuring  technique  which  gathered  the  observations  at  hand. 

The  following  are  requirements  or  assumptions  of  the  theory: 

(a)  The  universe  Is  defined  unambiguously.  It  must  be  clear  what  con- 
ditions fall  within  the  universe. 

(b)  Conditions  are  experimentally  independent.  For  example,  a person's 
score  or  rating  In  one  condition  does  not  depend  on  the  fact  that  he  or  she 
has  or  has  not  been  previously  observed  under  other  conditions. 

(c)  Conditions  are  randomly  selected  from  the  universe  of  conditions. 
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This,  assumption  Is  crucial  but  no  assumptions  are  made  about  the  content  of 
the  universe  or  about  the  statistical  properties  of  the  conditions  within  the 
universe.  The  restrictive  and  unnecessary  assumptions  of  the  classical  theory 
such  as  uniform  variances  and  co-variances  of  two  or  more  samples  of  items, 
persons,  etc.  are  eliminated. 

If  we  wish  to  generalize  to  persons  (raters)  then  for  each  person  p,  the 

universe  score  Mp,j  is  defined  as  the  expected  value  E(Xp..)  of  the  observed 

score  Xpi  over  all  conditions  in  the  universe.  If  we  wish  to  generalize  to 

situations  then  a universe  situation  mean  is  defined  in  a similar  fashion.  If 

we  define  generically,  Xc  as  the  sample  observatio'-  of  some  condition  c and 

Mc  as  its  expected  value  in  the  population,  the:,  we  can  define  the  squared 
2 

correlation  G x M _ estimated  universe  score  variance 
c c estimated  observed  score  variance 

as  the  coefficient  of  general  inability  which  indicates  how  well  one  can  genera- 
lize from  the  observed  data  to  the  universe  score.  This  definition  requires 

Xc  and  Mc  t0  be  random  variables.  We  will  see  shortly  when  we  discuss  estimates 
2 

of  G x M that  the  intra-class  correlation  coefficient  (Haggard,  1958)  is  a 
c c 2 

lower  bound  of  G ^ ^ and  can  be  easily  estimated  from  analysis-of-vaviance 
(ANOVA)  designs. 

Note  that  this  definition  does  not  require  an  outside  or  Independent 
criterion  against  which  to  assess  the  dependability  of  the  measuring  technique. 
Any  study  of  the  measuring  technique  will  have  Its  own  general Izablllty.  This 
Is  equivalent  to  what  some  Investigators  have  called  the  "external  validity" 
of  the  study  (Campbe  1 and  Stanley,  1963).  When  a study  has  been  completed 
and  a relation  found  between  some  Independent  and  dependent  variable  then 
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questions  of  external.  vanct*  refer  to  what  populations  can  this  relation  be 
Generalized  to,  and  how  extensive  Is  this  generalization?  This  as  contrasted 
to  internal  validity  which  refers  to  how  precise  (reliable)  the  study  was  In 
tne  first  place.  It  is  possible,  of  course,  to  have  highly  precise  experiments 
that  have  little  generality.  The  converse  Is  not  true.  Experiments  with  low 
interna,  validity  are  highly  Imprecise  and  thus  cannot  have  much  generality 
Campbell  and  Stanley  note,  and  correctly  so,  that  Internal  validity  Is  the 
sine  aua  non  of  a g„„d  research  design  but  unless  special  cautions  are  taken, 
the  results  of  a carefully  designed  study  are  not  representative  and  hence  not 
genera, Itable.  The  Ideal  design  should  be  high  In  both  Internal  and  external 
validity.  The  theory  of  general  liability  meets  this  problem  head  on  by  re- 
quiring the  Investigator  to  be  very  explicit  about  what  universe  he  wishes  to 

generalize  to  and  thus  forcing  him  to  design  “representative"  experiments  with 
the  zeal  advocated  by  Egon  Brunswfk  (1956). 

G and  D Studies 

The  theory  makes  the  distinction  between  genera, Iz.blllty  studies  (C- 
study)  ,„d  decision  studies  (0  study,.  The  D study  provides  Infomatlon  frm 
Which  decisions  about  Individuals,  groups,  and/or  situations  are  made,  while 
the  S study  is  used  to  assess  the  actual  measuring  technique.  The  design  of 
G and  D studies  may  be  one  and  the  same  but  they  are  often  different,  -he 
distinction  between  G and  0 studies  Is  more  than  a mere  recognition  of  the  fact 
that  certain  studies  are  carried  out  during  the  develop  of  a measuring 
instrument  and  then  the  Instrument  Is  utilized  In  other  studies  for  practical 
purposes.  The  distinction  Is  particularly  crucial  for  anyone  who  advocates  the 
use  use  of  certain  techniques  which  are  Calmed  to  be  better  than  others.  Such 
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claims  usually  can  be  demonstrated  in  laboratory-like  studies  but  these  tech- 
niques may  then  be  used  to  make  very  important  and  practical  social  judgments 
such  as  Coastal  Zone  Management  decisions.  The  distinction  is  particularly 
important  for  MAU  studies  to  clarify  analyses  of  just  how  ratings  are  assessed 
and  used.  In  Gardiner's  study  of  Coastal  zone  decisions  each  subject  (rater) 
gave  a utility  judgment  on  all  attributes  and  an  importance  weight  on  all 
attributes.  The  intra-class  correlation  among  raters  (coefficient  of  generali- 
zation) if  it  were  calculated  would  ignore  differences  in  rater  bias.  This 
would  be  the  appropriate  coefficient  in  a subsequent  D study  if  the  raters 
used  in  that  study  also  made'utility  and  importance  weight  judgments  on  all 
the  attributes.  However,  if  the  raters  in  a subsequent  D study  differed  on 
what  attributes  they  judged  or,  as  might  well  be  the  case  in  practical  situa- 
tions, different  persons  provided  the  utility  and  importance  weight  judgments, 
then  one  would  need  to  know  the  intra-class  correlation  that  treats  such  things 
as  rater  leniency,  possible  differences  in  giving  utility  judgments  versus 
importance  weight  judgments,  and  so  on. 

General Izability  and  Construct  Validity 

We  began  this  paper  by  being  concerned  about  the  reliability  and  validity 
of  multi -attribute  utility  techniques.  We  then  argued  that  the  classical 
theories  of  reliability  and  validity  are  not  satisfactory.  Validity  In  par- 
ticular Is  suspect  primarily  If,  by  validity,  one  means  how  well  a measuring 
Instrument  correlates  or  predicts  some  external  criterion.  It  Is  often  the 
case  that  this  criterion  Is  Itself  suspect  either  because  of  doubtful  relevance 
to  what  Is  really  intended  to  be  measured,  or  the  criterion  Itself  may  be  un- 
reliable. Because  of  these  difficulties,  psychometricians  have  introduced 
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another  deration  of  validity  called  construct  validity.  The  basic  Idea  of 
construct  validity  Is  that  any  measuring  Instrument  should  have  behavioral  or 
psychological  meaning  In  tenns  of  some  useful  psychological  construct  that  it 
purports  to  measure.  The  construct  of  anxiety,  for  example.  Is  a useful  and 
highly  valid  construct  since  ft  can  be  demonstrated  that  several  different 
ways  of  measuring  that  construct  „g..  Manifest  Anxiety  scales,  palmar  sweating, 
a-  correlate  reasonably  well,  and  what  Is  of  even  more  Importance,  the  measure- 
ment of  anxiety  In  people  enables  you  to  nehe  differential  predictions  about 
other  behaviors  for  people  who  are  located  on  different  scale  Ideations  of  the 
anxiety  scale.  High  anxiety  Individuals,  for  example,  perfonn  gulte  differently 

;airn’n9  taSk!  tha"  •"'**  ’"<"*•  Closer  to  home,  the  construct 

; ,ty  a,S°  be  construct  validity  since  when 

i it  es  are  measured  both  animals  and  hmnans  one  can  mate  differentia, 

ZTTir  h 3 " — (Sreeno, 

The  theory  of  general liability  has  ImDlicatw  * 

y 1mPi ications  for  construct  validity 

The  theory  regulres  an  Investigator  to  conduct  a G study  by  defining  a unlve.e 
" erest  to  him  and  then  make  observations  under  two  or  more  selected  con- 
■ «•  within  that  universe.  The  calculations  yielding  one  or  more  coeffl- 

:::  **  the  observed  scores 

that  n V WSe  SCOreS'  T”e  Un1verse  * considered  as  . construct 

Thu  ' SlnCe  ^ tM"kS  U h*S  eXPlanat°ry  - Predictive  power. 

of  cons!  " ^ 9enera,'iab1,,ty  •-  *.  be  an  Investigation 

- construct  validity.  Thus.  ,t  Is  not  necessary  within  this  framework  to 

' a distinction  between  reliability  and  validity.  This  notion  has  been 
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recognized  before  by  Tryon  (1957)  who  introduced  the  idea  of  a domain  sampling 
model  in  which  a sample  of  items  in  any  test  could  be  considered  a random 
sample  of  all  items  in  the  domain  or  universe  of  items.  Tryon  also  pointed 
out  that  if  the  sampled  items  were  tapping  some  interesting  domain  of  behavior 
then  reliability  could  also  be  considered  as  behavior  domain  validity  and  it 
is  not  necessary  to  distinguish  between  the  concepts  of  reliability  and  validity. 

Although  they  use  different  philosophical  reasons,  Miller,  Kaplan  and 
Edwards  (1967)  have  also  recognized  that  the  distinction  between  reliability 
and  validity  is  useless  especially  when  one  is  concerned  with  decision  making 
systems.  They  introduce  the  concept  of  intellectual  coherence  which  they  equate 
with  construct  validity  for  decision  systems.  To  quote  these  authors: 

Validation  is  simply  establishing  the  coherence  of  a procedure, 
or  several  procedures.  Thus  no  sharp  line  separates  the  concept 
of  reliability  from  that  of  validity;  both  concepts  refer  to 
agreements  among  measures,  and  a continuum  exists  from  cases  in 
which  the  measures  essentially  repeat  the  same  procedure  (reli- 
ability) to  cases  in  which  rather  different  procedures  seem  to 
measure  the  same  thing  (validity).  (Miller,  et  al.,  1967,  p.48) 

And  further  on,  ... 


We  assert  that  no  external  measure  of  the  performance  of  a judg- 
ment-based  decision-making  system  Is  possible.  Any  such  measure 
would  have  to  compare  the  decisions  the  system  made  with  decisions 
made  some  other  way,  and  there  would  have  to  be  some  good  reason 
to  suppose  that  the  decisions  made  the  other  way  were  right  ones. 
But  If  we  reject  the  Idea  that  the  business  of  a decision-making 
system  Is  to  Imitate  some  Individual's  decisions  (In  which  case 
the  only  point  of  building  the  system  would  be  to  save  the  Indi- 
vidual the  trouble  of  making  those  decisions  himself),  then  no 
bas  s remains  for  asserting  that  the  decisions  made  by  one  pro- 
cedure (e.g.,  by  the  commander)  are  Inherently  appropriate  simply 
because  they  were  made  by  that  procedure,  regardless  of  their 
content.  An  examination  of  the  merit  of  decisions  in  terms  of 
their  content  Is  a matter  of  Intellectual  coherence  or  reliability, 
not  val idity . 


We  assert  also  that  Intellectual  coherence  or  reliability  Is 
very  measurable  and  Is  in  fact  what  we  want  the  output  of 


decision-making  system  to  have. 

We  are  in  strong  agreement  with  these  statements.  Also,  we  believe  that 
"intellectual  coherence  or  reliability"  can  be  demonstrated  using  the  Theory 
of  General izability. 

Analysis  of  Variance  (ANOVA)  and  Variance  Components 

The  theory  of  general izabil  Ity  in  the  conduct  of  both  G and  D studies 
makes  extensive  use  of  ANOVA  models  which  are  more  general  and  incorporate  as 
special  cases  the  familiar  correlational  designs  utilized  by  the  classical 
approach.  ANOVA  designs  distinguish  between  random  and  fixed  factors.  A 
random  factor  is  one  In  which  the  levels  of  the  factor  are  considered  a sample 
from  a universe  of  all  possible  levels  whereas  a fixed  factor  exhausts  all 
levels  of  Interest  for  that  factor.  It  should  be  obvious  from  the  previous 
discussion  that  a coefficient  of  generalization  for  any  condition  of  a G or  D 
study  makes  sense  only  If  the  levels  of  the  factor  for  that  condition  are 
random.  A fixed  factor  In  an  ANOVA  design  exhausts  all  the  levels  of  Interest 
for  that  factor  and  there  Is  nothing  to  generalize  to.  However,  many  G and  D 
studies  may  employ  both  fixed  and  random  factors  (mixed  designs). 

The  general  procedure  In  conducting  either  a G or  D study  Is  to  utilize 
an  ANOVA  design  which  will  then  yield  the  familiar  sums-of-squa»*es  and  mean 
squares.  The  conduct  of  F tests,  however,  Is  rarely  done  since  one  Is  not 
usually  Interested  in  testing  hypotheses  but  rather  In  estimating  various 
expected  values  of  the  mean  squares  In  the  fashion  suggested  by  Cornfield  and 
Tukey  (1956).  These  estimates  are  then  used  to  report  the  results  of  the  study 
In  terms  of  the  components  of  variance  accounted  for  and  an  estimate  of  the 
coefficients  of  general Izabil Ity.  We  will  dispense  with  the  formal  theory 
which  Is  well  presented  In  Cronbach,  Gleser,  Nanda  and  Rajaratnam  (1972)  and 
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any  good  ANOVA  book  such  as  Winer  (1970) » or  Kirk  (1968)  and  proceed  to  give 
several  numerical  examples. 

Examples 

The  first  example  uses  fictitious  data  In  a simple  study  of  how  to  analyze 
judgments  of  the  Importance  of  attributes  as  they  might  be  obtained  In  a typi- 
cal MAU  study.  The  second  two  examples  are  more  complicated  and  use  data  from 
actual  experiments. 

Example  1:  Analysis  of  raters  making  Importance  judgments  about  attributes. 

In  MAU  studies  one  task  for  the  "expert"  subjects  Is  to  make  judgments 
of  Importance  for  each  of  the  attributes  under  consideration  for  che  decision. 
Suppose  we  have  four  experts  rate  each  of  six  attributes  on  Importance  on  a 
10  point  scale  ( 1 = least  Important;  10  * most  Important).  The  result  might 
be  like  that  reported  In  Table  1. 

Insert  Table  1 airut  here 


Since  we  are  Interested  In  how  well  the  rater  might  be  doing  at  this  task 
we  ask  questions  about  the  general Izablllty  of  the  measuring  Instrument,  l.e., 
the  raters  judging  Importance  of  attributes.  The  data  In  Table  1 are  easily 
analyzed  by  ANOVA  with  the  results  given  In  Table  2. 

Insert  Table  2 about  here 

The  expected  mean  squares  E(MS)  In  the  last  column  of  Table  1 are  the 
population  values  of  the  sample  variances  (mean  squares).  For  those  readers 
not  familiar  with  expected  mean  squares  the  following  Intuitive  explanation  Is 
offered:  each  expectation  consists  of  an  error  variance  component,  since  all 
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Table  1 

Importance  K .igs  Gi  -e  by  4 Raters 
on  Each  of  6 Attributes 

Attribute 


5 6 Mean  (R) 


14. 


I 
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Table  2 


Analysis  of  Variance  of  Ratings 


A (Between  Attributes)  5 

B (Between  Raters)  3 

AB  (Residual)  25 

TOTAL  23 


Sums  of 
Squares 


Mean 

Square 


122.50 


24.50 


17.50 


5.83 


18.50 


158.50 


Expected  Mean 
Square  E(MS) 


°e2  + 4oa2 

°e2  + So62 
2 
°e 
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SinP,e  mMSUrmentS  *"  ^ «"*  Pl«*  a "weighted"  variance  exponent 

of  the  assumed  treatment  effect,  in  this  case  the  effect  due  to  the  rows  (raters) 

and  that  due  to  colons  (attributes).  Each  treats  variance  Is  weighted  by 

the  number  of  sample  values  that  contributed  to  each  treatment  level  mean.  For 

example,  the  sun-cf-squares  due  to  attributes  In  Table  1 came  from  the  fact 

that  each  column  attribute,  mean  Is  deviating  f™  the  grand  mean,  with  each 

rpni^oconf  4mm  il.  _ 


■ — nuwever, 

representing  the  four  values  that  went  In  to  calculate  each  mean.  Thus  the 

eXDPrtert  ual..«  - 
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expec  e value  component  for  the  effect  due  to  attributes  (o/)  flets  weighted 

y • In  a similar  fashion  six  values  contributed  to  the  mean  for  each  row 

effect  and  thus  the  expected  value  component  for  the  effect  due  to  raters  gets 
weighted  by  6. 

The  genera,  method  for  obtaining  the  expected  mean  squares  «*)  for 
MOVA  designs  „ straight  forward  but  Involves  tedious  algebra.  Fortunately 
Cornfield  and  TuKey  (1956)  have  provided  a convenient  algorithm  which  enables 
one  to  set  down  the  appropriate  expected  mean  squares  for  any  ANOVA  design 
his  algorithm  Is  explained  In  Klrh  (1968)  and  Klner  (1970).  since  the  calcu- 

at,7f  **"*  - -1.1  for  the  theory  of  genera, Itablllty 

“ "**  . „„ 
*•  tb  ... ...... ..  w w 

“» ... 

the  variance  component  for  factor  A (attributes)  Is 
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- 24.50  - 1.23 

«•  — — ■ ■ ii 

4 

= 5.82 

A 2 7 

Where  is  an  estimate  of  the  variance  component  and  MS^  and  MSe  are  the 

mean  squares  for  the  A factor  and  error  respectively.  Since  this  was  an  ANOVA 
with  only  one  observation  per  cell  the  MS  for  Interaction  is  the  error  variance 
component.  Incidentally,  while  variance  components  cannot  be  negative,  esti- 
mates of  such  components  can  be  negative  due  to  sampling  error.  If  any  variance 
estimate  is  negative,  it  should  be  set  equal  to  zero.  (An  example  of  such  a 
possibility  is  given  in  the  third  example.) 

The  estimate  of  each  variance  component  and  the  percent  of  variance 
accounted  for  in  the  experiment  can  be  listed  In  the  following  diagram: 

Variance 

Component 

Variance 

Estimate 

Percent  of 

Variance 

This  Is  a fairly  precise  experiment  with  about  16%  of  the  variance  due 
to  "error"  and  thus  84%  Is  predicted  variance. 

Now,  we  can  estimate  the  coefficient  of  general Izablllty  (G2)  for  the 
experiment  as  follows: 
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A2 


Rf?mafe|  hin1ver^  Score  Variance 
EstfiMLed  Observed  Score  Variance 


A 2 
°A 

/>  2 a ? 

°A  ♦ 0. 


5.82 


5.82  + 1.23 


= .83 

vr'”,  “ “ “~u”  — - 

■ " w ~ 

ts  an  estimate  of  the  lower  bound  of  G2. 

variant"  "'terPrem'°n  °f  ^ 18  that  U ,s  the  Proportion  of  universe  score 
accounted  for  by  knowledge  of  the  observed  score  variance. 

Another  Interpretation  Is  as  follows-  If  thp  a 

l0WS-  If  the  experiment  were  to  be  re. 
peated  with  another  random  sample  of  four  raters  from  th 

raters  whn  the  Same  unfverse  Of 

then  rate  the  same  attributes,  the  squared  correlation  between  the 

r * - - - -rs  would  bo  ,3.  fho  Closer  SW 

7*;rreSent*Mye  *'  «“  “‘a  «f  th,  universe  of  Interest  A, 

t ough  a a*  Of  ,«3  1,  qult,  respectable.  ,t  Is  well  to  ash  of  the  0 study  y 

~ “ — - - - «.  - ::r 

~ zz  r r;.  :r;  :/rr  - —• 

• rater  3 Is  for  the  most  part  In  goMMAnf 

:::::  rr;,th  the  ,,rst  - ^ ^ « *■  -«w 

*ct,on — therefirr;::-;4 r up  ,n  the  ,ntw- 

inflate  the  error  variance  component.  In 
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order  to  move  the  coefficient  of  general  liability  (G2)  closer  to  one,  a reduc- 
tion  in  this  interaction  would  have  to  occur. 

Example  2:  The  JUDGE  Experiment.  (A  Decisirn  (p)  study) 

As  our  second  example  we  use  the  experimental  results  of  e somewhat 
novel  application  of  decision  theory  to  a Tactical  Air  Control  System  provided 
by  the  work  of  Hiller,  Kaplan  and  Edwards  (1967,  1968).  They  have  proposed  a 
system  called  Judged  Utility  Decision  Generator  (JUDGE)  for  allocating  air 
strike  missions  to  requests  in  tactical  air  control  environments.  A key  con- 
cept of  JUDGE  is  that  value  judgments  (estimation  of  military  worth  associated 
with  requests)  can  be  made  directly  and  in  real  time  by  appropriately  trained 
personnel,  and  that  the  system  should,  in  principle,  maximize  the  aggregate 
Utility  over  all  the  dispatching  decisions  it  makes.  The  actual  details  of  how 
JUDGE  works  is  not  important  for  this  discussion.  What  is  important  is  that 
the  system  makes  extensive  use  of  human  judgments  which  are  computer  assisted 
and  the  advocates  of  this  system  claim  that  it  might  be  a better  system  than 
that  presently  being  used  by  the  Air  Force  in  tactical  situations.  In  order 
to  lend  support  to  this  notion  these  investigators  conducted  a laboratory  study 
comparing  JUDGE  against  a version  of  the  currently  used  system  called  Direct 
Air  Support  Center  (DASC).  We  have  taken  the  liberty  of  using  the  data  reported 
for  this  experiment  to  illustrate  the  basic  concepts  end  interpretations  of  the 
theory  of  general  liability  and  what  is  presented  below  should  not  be  construed 
as  an  interpretation  of  the  Miller  ,t  .1.  experiment.  In  the  experiment,  each 
of  14  subjects  participated  in  both  the  JUDGE  and  DASC  modes  of  tactical  co«.nd, 
in  four  simulated  air  tactical  comnand  stations,  and  there  were  two  replications 
«f  the  experiment.  Thus  from  an  ANOVA  design  viewpoint  this  can  be  considered 
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as  a three  factorial  design  with  Systems,  Situations,  and  Subjects  being  the 
three  factors  of  interest  or,  more  specifically,  a 2 (systems)  x 4 (situations) 
x 14  (subjects)  design  with  two  (?)  replications  per  cell  (each  subject  per- 
formed twice).  Two  dependent  variables  were  used  in  this  study:  a measure  of 
efficiency  and  a measure  of  effectiveness.1 

Since  the  main  purpose  of  this  study  was  to  decide  which  system,  JUDGE 
or  nsc.  was  the  best,  we  will  first  consider  the  analysis  of  the  results  of 
the  study  as  a D (decision)  study.  Table  3 presents  the  AdOVA  sunmary  of  the 
results  using  the  effectiveness  measure  as  the  dependent  variable.  (The  results 

are  similar  and  wi" not  be  p™*-**.) 

Insert  Table  3 about  here 


The  calculation  of  the  Expected  Mean  Squares  E(MS)  assumes  the  A and  B factors 
are  fixed  and  the  C (subjects)  and  Within  Replicates  are  random  factors. 

Again,  once  the  expected  mean  squares  are  set  down  it  is  a matter  of  simple 
algebra  to  estimate  the  variance  components  of  the  factors  and  their  inter- 
actions in  the  experiment.  For  example,  the  variance  component  for  factor  A 
(Systems,  JUDGE  vs.  DASC)  Is  estimated  by 


a 2 
°A 


msa  - msac 
112 

1834.33  - 9.8Q 
112 

16.29 


1.  The  data  are  presented  In  Appendix 


B of  Miller,  Kaplan  and  Edwards  (1968). 


Table  3 
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ANOVA  Summary  Table  for  the  Judge  Experiment 


Source  of  Variation 

Degree  of 
Freedom 

Sums  of 
Squares 

Mean 

Squares 

Expected  Mean 
Square  E(MS) 

A (Systems) 

1 

1834.33 

1834.33 

°e2  + 8°AC2  + I12°A2 

B (Situations) 

3 

21.70 

7.23 

°e2  + %C2  + 56oB2 

C (Subjects) 

13 

63.89 

4.91 

2 ? 
V + 16oc 

AB 

3 

65.82 

21.94 

°e2  + 2oABC2  + 28i,AB: 

AC 

13 

128.57 

9.89 

o 2 + 8 o 2 
e AC 

BC 

39 

114.94 

2.95 

O 2 + 4oDr2 
e BC 

ABC 

112 

91.51 

0.82 

a 2 
e 

TOTAL 

223 

2320.76 
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where  is  an  estimate  of  the  variant  ? 

or  tne  variance  component  a.2  and  MS  and 

mean  squares  fn»*  A a and  MS/\C  are  the 
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mea"  s‘>ua"is  f®r  the  A factor  and  the  AC  Inter,  „ 

AC  fraction  respectively. 

"^"‘e  °f  Mch  stance  exponent  and  the  percent  of  tota,  , 

accounted  for  In  the  experiment  can  be  listed  the  f ,,  1a"Ce 

nsted  in  the  following  diagram: 


Variance 

Component 


Variance 

Estimate 


Percent  of 
Variance 


AB 

AC 

.68 

1.13 

03 

05 
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the  different  systems  (JUDGE  1.  OAX)  ZlZ'  H0WeVer’  ^ ''4r1anCe  dUe  *° 
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was  so  strong,  i.e. , accounted  for  such  a large  tMS  faCt°r 

say  very  little  about  the  other  fa  Parentage  of  variance  we  can 
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S P‘rt,CUUr  eXPartat  ,nd1c.tes  that  JUDGE  is  a more  effective 


ABC 

Error 

Total 

I 

.93 

.82  1 

22.30 

1 

04 

04 

100% 

in  a simulated  tactical  air  control  environment  than  DASC,  we  can  turn 
our  attention  to  determining  the  generalizability  of  JUDGE  as  contrasted  to 
DASC.  To  do  this,  we  will  reanalyze  the  data  for  each  system  separately  Thi 

15  V8Hd  """  the  1nitisl  desi*"  »«  completely  crossed,  i.e.,  a„  subjects 
participated  in  all  conditions  of  the  experiment. 

Analysis  of  the  JUDGE  Experiment:  A Generalizabilitv  fcl  stud^ 

As  an  illustration  of  how  a generalizability  study  may  proceed,  we  have 

reanalyzed  the  JUDGE  experiment  as  two  2-factor  experiments;  one  with  the 

subjects  operating  under  the  DASC  system  and  the  other  with  the  same  subjects 

operating  under  the  JUDGE  system.  Two  separate  ANOVAs  were  carried  out  with 

the^resulUdisplayedJnJables  4 and  5.  In  Tables  4 and  5,  the  A factor 

Insert  Tables  4 and  5 about  here 


(situations)  is  considered  fixed  and  the  B factor  (subjects)  and  within  repli- 
cates  are  random  effects.  The  appropriate  expected  mean  sguares  are  given  in 
the  last  column  of  each  table  and  frx  these  estimates  of  the  variance  expo- 
nents and  the  percent  of  total  variance  can  be  calculated  and  are  given  under 
each  table.  Now  since  we  wish  to  generalize  to  the  population  of  subjects  we 

can  estimate  a coefficient  of  generalization  for  each  of  the  two  systxs  as 
follows: 


S2 


a 2 
aB 


^ + A [ > A* 

°B  °AB  + ae 


A? 

G (DASC) 


1.27 


» A factor  fixed 
■ .30 


(JUDGE) 


1.27  + 1.80  + 1.22 
.37 

.36  + .20  ♦ .42 


.37 


TABLE  4 

ANOVA  Suninary:  DASC  System 
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Total 


111 


Source 

Degree  of 
Freedom 

Sums  of 
Squares 

Mean 

Square 

Expected  Mean 

A (Situations) 

3 

78.25 

26.08 

B (Subjects) 

13 

148.46 

11.42 

+ H2 

AB 

39 

187.70 

4.81 

°e2  * 

Within  replicates 

56 

68.10 

1.21 

2 

482.51 


Variance 

Component 

Variance 

Estimate 

Percent  of 
Total  Variance 


A 

B 

AB 

Error  (e] 

Total 

.76 

1.27 

1.80 

1.22 

5.05 

15 

25 

36 

i 

24 

100 

ANOVA  Summary:  JUDGE  System 


Source 

Degree  of 
Freedom 

Sum  of 
Squares 

Mean 

Square 

Expected  Mean  Square 

A (Situations) 

3 

9.26 

3.09 

ge2  + 

2oab2  + 28oa 2 

B (Subjects) 

13 

42.99 

3.31 

°e2  + 

8ob2 

AB 

39 

31.75 

.81 

ae2  + 

2oab2 

Within  replicates 

56 

J3.42 

.42 

2 

°e 

Total 

111 

107.42 

Variance 

Component 

A 

B 

AB 

— 
Error  (e) 

Total 

Variance 

Estimate 

.08 

.36 

.20 

.42 

1.06 

Percent  of 
Total  Variance 

7 

34 

19 

40 

100 
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Thus  we  see  that  the  JUDGE  system  has  a higher  coefficient  of  general izability 
and  thus  can  be  considered  as  more  dependable  (reliable  and  valid)  than  the 
DASC  system.  We  need  not  seek  nor  rely  on  some  outside  independent  criterion 
to  help  us  reach  this  decision.  We  can  also  see  that  the  DASC  system  is  not 
that  much  worse  than  JUDGE  with  respect  to  its  general izability  for  subjects, 
being  only  seven  percent  “poorer".  This  was  due  primarily  to  th»  fact  that 
the  interaction  of  subjects  x situations  variance  component  was  higher  for 
the  DASC  system  than  for  the  JUDGE  system.  This  interaction  term  gets  in- 
cluded in  the  estimate  of  the  total  observed  score  variance.  Also,  what  it 
means  from  a behavioral  standpoint,  is  that  the  subjects  when  in  the  DASC 
mode  were  not  being  as  consistent  in  their  responses  to  the  four  situations 
as  when  performing  in  the  JUDGE  mode.  It  should  be  remembered  that  the 
tasks  for  the  two  subjects  are  different  in  the  two  systems.  In  DASC  the 
subjects  are  asked  to  make  dispatching-like  decisions  in  the  simulated  tacti- 
cal situations  whereas  in  JUDGE  they  are  making  value  judgments  (estimation 
of  military  worth  associated  with  requests  for  a mission).  These  value 
judgments  are  expressed  numerically  combined  with  an  estimated  probability 
of  "kill"  along  with  certain  constraints  such  as  the  availability  of  air- 
craft.and  the  dispatching  design  Is  made  automatically  by  a computer  generated 
dispatching  rule.  The  data  presented  in  Tables  4 and  5 Indicate  that  when 
subjects  are  asked  to  make  value  judgments  and  these  judgments  are  then 
used  In  an  automatic  algorithm  then  the  entire  system  responds  more  con- 
sistently to  various  situations.  With  this  design  the  presence  of  interaction 
effects  tends  to  reduce  the  general  Izability  over  any  set  of  conditions. 

One  final  point  before  leaving  this  example.  Although  the  situations 


J 


factor  was  I ,xed  In  this  analysis,  it  does  appear  that  this  was  a stronger  1n- 
P ent  variable  when  the  subjects  were  performing  under  the  OASC  system  (15% 
of  the  total  variance)  than  when  they  were  performing  under  the  JUDGE  (7%  of 
the  total  variance).  This  might  suggest  that  In  a future  G study  wh,ch 
genera, lability  to  situations  might  be  a desired  goal  that  the  DASC  syst*i 
ntght  fare  better  than  JUDGE.  Actually  „o  such  prediction  can  be  made  until 
the  actual  G study  is  performed  with  the  situations  variable  being  i„cluded 

" thS  deSlS"  °f  the  StUdy  “ 3 ranto  factor-  However,  It  is  often  the  case 
th8t  9e"al-a'1Za*>"'-^  * situations,  stimuli,  tasks,  and  so  on  are  as  Impor- 
tant, If  not  mere  important,  as  generalizations  about  people.  An  illustration 
O such  a case  Is  given  In  the  next  numerical  example. 

to  P bH^  p"er  4)  MS  St“dy  JPP,y1n9  '"Ult1-attr1bUte  “«"‘y  techniques 
°HCy  d6CfS,°"  had  *«M«ts  make  judpents  about  whether 

certain  permit  requests  for  various  developments  along  the  Southern  California 

oast  should  be  approved  or  disapproved  by  a Coastal  emission  which  has  the 

.U  Horny  to  approve  or  deny  such  requests.  The  subjects,  seme  of  whom  were 

C°"US,°"  memb4rs-  ”“ds  ’"««*•  “Holistic  Judgments  and  also  made 
v;  ue  o worth  Judgments  about  the  worth  of  each  permit  along  eight  different 

rr  char:ter,z,n9  i„cluded 

edge  T ” * # h<!l9ht  °f  *"*  PrCP°S'd  deve,ofm*nt-  from  the  water's 

eelsions.  Gardiner  took  sped,,  p.,„s  to  select  a sample  of  15  permits 

He  .::*r of  the  kind  that  usuaiiy  com§  bef°re  ^ c°#stai 

n erested  in  comparing  two  sub-groups  of  his  subjects  who  described 
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themselves  as  "Developers",  i.e.,  generally  leaning  toward  development  of  the 
coastal  line,  and  "Conservationists",  who  were  generally  opposed  to  develop- 
ments that  might  destroy  the  natural  coastal  line.  One  phase  of  his  analysis 
utilized  a two  factor  ANOVA,  with  the  groups  (Developers  vs.  Conservationists), 
and  permits  being  the  uwo  factors.  He  had  15  permits  and  7 subjects  in  each 
group,  thus  this  was  a 2 x 15  factorial  design  with  7 replications  per  group. 
The  results  are  given  in  Tables  6 and  7.  Table  6 is  the  result  for  the  whcl- 
istic  evaluation  of  permit  worth  and  Table  7 is  for  the  MAU  evaluation  of 
permit  worth. 


Insert  Tables  6 and  7 about  here 


In  these  two  tables  the  group  factor  (Developers  vs.  Conservationists) 
is  considered  fixtd  and  the  permit  factor  and  within  replicates  are  considered 
to  be  random. 

p 

The  coefficients  of  general i zabil ity  (G  ) for  the  two  analyses  given 
in  Tables  6 and  7 are: 


a 


2 

B 


, A factor  fixed 


(wholistic) 


274.58 


274.58  + 157.94  + 431.35 


.32 


G*  (MAU) 


124.78 


125.78  + 0 + 106.09 


.54 


Thus,  the  MAU  technique  is  more  dependable  (generalizable)  than  its 
wholistic  counterpart.  In  other  words,  if  one  were  to  generalize  to  the  domain 
or  population  of  all  possible  permits  (of  the  kind  investigated  in  Gardiner's 
study)  then  one  would  be  in  a better  position  using  MAU  than  wholistic  judgments. 


TABLE  6 


Source 

A (Group) 

B (Permits) 

AB 

Within  replicates 
Total 


Variance 

Component 

Variance 

Estimate 

Percent  of 
Total  Variance 


ANOVA  Summary:  Wholistic  Evaluation 
of  Permit  Worth 


Degree  of  Sum  of 

F reedom Squares 


1 

13,675.24 

14 

59,845.38 

14 

21,516.18 

180 

77,643.00 

209 

172,679.80 

Mean 

- Square  Expected  Mean 
13*75.24  - ?oAB2 

4274.67  oe2  + 14ob2 

1536.87  cc2  + 70ab2 

431.35  o 2 
e 


A 

B 

AB 

Error (e) 

Total 

115.60 

274.53 

157.94 

431.35 

979.42 

12 

28 

16 

44 

100 

Square 
+ 105oa2 


TABLE  7 


ANOVA  Summary:  MAU  Technique  Evaluation 
of  Permit  Worth 


Source  Degree  of 

Freedom 

Sums  of 
Squares 

Mean 

Square 

Expected  Mean  Square 

A (Group) 

1 

2128 

• 51 

2128.51  < 

’e2+  7oAB2  + 105°/ 

B (Permits) 

14 

25,942. 

.00 

1853.00  a 2 + 14a  ‘ 
e 

l 

AB 

14 

1086.82 

77.63  c 

2 

! + In 

2 

e AB 

Within  replicates 

180 

19,096. 

20 

106.09  o 

2 

e 

Total 

209 

48,253.53 

f . 

var lance 

Component 

A 

B 

AB1 



Error  (e) 

Total 

Variance 

Estimate 

19.54 

124.78 

0 

106.09 

250.41 

Percent  of 

Total  Variance 

08 

50 

0 

42 

100 

1.  The  mean  square  estimate 

to  sampling  error  and  was 


(aAn)  for  the  AB  Interaction 
set  to  zero. 


was  negative  due 


Note  that  the  distribution  of  predicted  variance  is  quite  different 
for  the  two  methods  with  the  most  dramatic  difference  being  the  elimination 
of  the  interaction  component  when  the  subject's  responses  are  used  under  the 
MAU  technique.  The  statements  made  earlier  are  worth  repeating  here:  When 
human  judgment  is  being  used  in  any  scientific  or  practical  study,  any  inter- 
action between  the  human  judges  and  the  objects,  conditions  or  persons  being 
judged  may  be  a form  of  inconsistency  and  lowers,  the  dependability  (general iz- 
ability)  of  those  judgments.  There  may  be  an  important  principle  here,  one  of 
considerable  theoretical  and  practical  importance.  Any  "divide-and-conquer" 
technique  such  as  a MAU  technique  may  minimize  or  at  least  reduce  substantially 
interaction  sources  of  variance  due  to  inconsistency  thus  making  any  study  or 
application  of  the  technique  easier  to  interpret.  This  is  not  to  say  that 
components  of  variance  due  to  interaction  should  always  be  reduced.  There  are 
certainly  situations  in  which  individual  differences  represent  valid  differences 
in  judgments  about  utilities.  Arch  conservationists  may  have  quite  different 
ideas  about  what  is  "best"  for  the  California  coastline  than  arch  developers. 

We  certainly  would  not  want  a technique  that  blurs  or  reduces  such  differences. 
The  theory  described  in  this  paper  must  be  applied  to  such  situations  in  lab- 
oratory and  "field"  studies  to  see  how  useful  the  theory  is  in  such  situations. 
Comment  on  Random  Sampling 

The  theory  of  general Izability  makes  one  powerful  assumption:  any  sample 
of  observations  must  be  a representative  random  sample  from  the  universe  or 
population  one  wishes  to  generalize  to.  The  question  immediately  arises  as  to 


whether  one  should  truly  use  the  operation  of  complete  random  sampling  of 
conditions  from  seme  universe  In  order  to  generate  the  set  of  conditions  to  be 
used  in  any  G study.  Presumably  Brunswlk  (1956)  would  argue  yes  but  I would 
argue  that  It  Is  not  a necessity.  The  choice  of  the  levels  of  a factor  In  an 
ANOVA  design  must  rest  with  the  Investigator  and  it  is  the  responsibility  of 
that  Investigator  to  state  whether  that  factor  is  random  or  fixed  for  any  given 
situation  and  give  h<s  reasons,  which  other  Investigators  may  or  may  not  agree 
with.  It  may  be  that  the  use  of  random  sampling  may  be  the  best  way  to  choose 
the  levels  as.  for  example,  in  the  study  of  fom  perception  using  methods 
suggested  by  Attneave  (1954).  In  selecting  what  development  permits  to  be  used 
in  his  study.  Gardiner  could  have  used  some  random  sampling  scheme  such  as  going 
to  the  files  of  proposed  permits  in  the  California  Coastal  Commission's  office 
and  by  using  some  random  plan  select  his  15  permits.  However,  this  runs  the 
risk,  with  a fairly  high  probability,  of  yielding  a set  of  15  penults  that  were 
not  as  representative  of  the  universe  of  penults  as  It  should  be.  What  Gardiner 
did  was  to  rely  on  expert  judgment  (his  own)  to  select  a ,1st  of  penults  that 
*u,d  be  useful  In  his  study.  This  „st  covered  a broad  range  of  typical  permits 
that  included  almost  a„  of  the  kinds  of  proposed  deve, events  of  Interest  to 
the  study  and  the  practical  application  In  mind. 

The  assumption  of  random  sampling  In  any  G or  0 study  should  remain  that: 

an  assunptlon  on  the  part  of  the  Investigator.  Of  course  a„  the  principles 

involved  in  good  experiment.,  design  should  be  employed  to  make  that  assumption 
reasonable  and  plausible. 

Summary 

This  paper  has  presented  a theoretical  rationale  for  assessing  the  de- 
pendability. validity,  reliability  or  Intellectual  coherence  of  multi-attribute 


utility  models  and  techniques.  If  an  investigator  is  advocating  the  use  of  a 
MAU  model  or  procedure  he  or  she  is  interested  in  generalizing  from  obser- 
vations at  hand  to  a universe  or  domain  of  observations  that  are  members  of 
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that  same  universe.  The  universe  must  be  unambiguously  defined  but  it  is  not 
necessary  to  assume  that  universe  as  having  any  statistical  properties  such  as 
uniform  variances  or  covariances.  A study  of  general izabil ity  (G  study)  is 
conducted  by  taking  measurements  on  persons,  stimuli,  tasks,  etc.  that  are 
assumed  to  be  randomly  representative  of  a universe  an  investigator  wishes  to 
generalize  to.  The  ratio  of  an  estimate  of  the  universe  "score"  variance  to  an 
estimate  of  the  observed  score  variance  is  the  coefficient  of  general izabil ity. 
This  is  estimated  by  the  intra-class  correlation  coefficient.  ANOVA  and  the 
Expected  Mean  Square  paradigm  of  Cornfield  and  Tukey  is  used  to  obtain  the 
appropriate  variance  estimates. 

The  theory  dispense*  with  unnecessary  and  unwarranted  assumptions,  and 
eliminates  the  distinction  between  reliability  and  validity.  Any  G study  can 
be  conducted  without  reference  to  having  a pa -all el  measure  of  the  MAU  instru- 
ment or  some  external  criterion  of  "success".  If  a MAU  technique  is  compared 
to  some  non-MAU  technique  for  doing  the  same  thing  then  it  is  possible  to  cal- 
culate the  coefficient  of  general izabil ity  for  both  methods  :us  allowing  the 
investigator  to  decide  which  is  best  for  his  or  her  purposes.  Preliminary 
investigations  have  indicated  that  MAU  models  and  techniques  based  on  such 
models  may  be  "better"  than  non-MAl  models  since  the  former  have  a tendency 
to  reduce  the  Interaction  between  judges  and  the  thing  being  judged  when  such 
Interaction  represents  Inconsistency  of  judgment.  The  extent  of  this  principle, 
If  Indeed  It  Is  true  at  all,  needs  further  work. 
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Appendix  A 

An  Expected  Mean  Square  Algorithm 

The  theory  of  general izability  requires  an  investigation  to  estimate 
variance  components.  This  in  turn  requires  the  calculation  of  the  expected 
value  of  the  mean  squares  (MS)  generated  by  an  analysis  of  variance  (ANOVA) 
study.  The  calculation  of  these  expected  mean  squares  E(MS)  is  straightforward 
but  involves  tedious  algebra.  Fortunately,  Cornfield  and  Tukey  (1956)  have 
provided  a convenient  algorithm  for  generating  the  E(MS)s  for  any  ANOVA  design. 
Tnls  procedure  is  explained  in  standard  texts  such  as  Winer  (1971),  and  Kirk 

(1968). 

The  procedure  is  Illustrated  below  by  following  a set  of  rules.  This 
is  a modification  of  that  provided  by  Kirk  (1968,  p.  209-210). 


Rule  1.  Write  the  linear  model  for  the  design.  If,  for  example, 
there  are  two  factors  A,  B and  n replications  the  model  is: 


t 


Y1jm  * * + 


Rule  2.  Construct  a two  way  table  such  as  Table  A as  follows: 

(a)  The  rows  of  the  table  are  labeled  as  the  factor  effects 
excluding  the  general  mean.  The  columns  of  part  1 of  the  table 
are  labeled  with  the  subscripts  and  the  limit  of  the  subscript 
(number  of  levels  of  each  factor). 

(b)  Part  2 of  the  table  is  labeled  as  E(MS) 


I 

. 
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TABLE  A 

Expected  Values  of  Mean  Squares  for  a Two  Factor 
ANOVA  Design 


Rule  3.  Each  entry  below  each  column  in  part  1 is  determined  as 
follows: 

(a)  If  the  column  heading  appears  as  a subscript  of  a row  tern 

enter  the  sampling  fractions  appropriate  for  that  column  1 - 1 - £» 

etc.,  where  a and  b are  the  levels  of  each  factor  and  A and  B are  the 
total  number  of  possible  levels. 

(b)  If  the  column  heading  does  nut  appear  as  a subscript  of  a 
row  term  enter  the  appropriate  letter  for  that  column,  e.g.,  a,  b,  n, 
etc.  In  the  row. 

(c)  The  last  row  should  contain  all  ones  under  each  column  head- 
ing. For  most  designs  there  Is  no  sampling  fraction  for  the  replicates 
effect,  since  the  n replicates  for  any  experiment  Is  usually  very  small 
relative  to  the  total  number  of  possible  replicates,  l.e.,  (1  - jj-)  - 1 
for  large  N. 


(1  * flO  ■ 1 since  In  almost  all  applications  the  number  of  replications  n Is 
considered  very  small  relative  to  all  possible  replication  N. 


Rule  4.  For  each  row  in  part  2 of  the  table  (E(MS))  list  the  variance 

of  the  linear  model  term  that  contain  all  the  subscripts  of  the  row  term, 

for  example,  the  subscript  of  the  first  row  is  i.  Variances  in  terms  of 

the  linear  model  that  contain  subscript  i are  a 2,  a2,  and  o 2 

e AB  A ’ 

Rule  5.  The  coefficients  of  the  variance  for  each  E(MS)  are  obtained 
by  covering  up  the  columns  headed  by  the  subscripts  that  appear  in  a row 
and  multiplying  each  row  E(MS)  variance  by  the  remaining  terns  in  part  1 
of  the  table.  For  example,  the  coefficients  in  the  first  row  for  a^Z  are 
n and  b which  are  found  in  the  first  row  of  the  table.  The  coefficients 
for  °AB2  are  n and  ^ ^ which  are  found  in  the  third  row  of  the  table. 

The  coefficient  for  og2  is  always  1 

The  £ (MS ) for  any  main  effect  always  includes  the  error  variance  o ^ 
plus  all  variance  terms  in  which  it  is  included.  In  other  words  the  E(MS) 

is  a weighted  sum  of  all  the  variance  components  that  contain  the  subscripts 
of  the  main  effects. 

Rule  6.  The  sampling  fractions  (1  - a),  (i  - {>),  etc.  tend  to  reduce 
the  variance  term  for  which  they  are  coefficients  and  suppress  them  com- 
pletely when  the  factor  is  fixed.  For  example,  if  factor  A is  fixed  and 
thus  a exhausts  all  levels  of  interest  a = A and  (1  - J-)  * 0.  If  the 
factor  Is  considered  random  and  a is  small  relative  to  A the  sampling  frac- 
tion is  one.  There  may  be  practical  situations  in  which  values  for  the 

sampling  fractions  between  0 and  1 may  be  appropriate  but  these  two  values 
are  most  often  used. 


